Primary Models
Three models are fitted to the data, each with its own strengths, chosen to mask the weaknesses of the others. K-nearest neighbours with kernel smoothing is used to capture local similarity not only geospatially but throughout the entire feature space. Gradient boosted descision trees are used for their robustness and successful track record. Finally a generalised additive model is fit, fitting first splines to the features and then taking a linear combination of these nonlinear functions.
Before fitting, it is useful for tree and neighbourhood models to encode the factor variable type to a numeric. They are encoded in order of their mean price. When doing target encoding, it is often wise to regularise the ordering by using kfold or expanding mean encodings, but in this case, with only three levels and an obvious separation in price, … talk about not best practices?
The method faeture tells a similar story, details are omitted.
ggplot(property_data,aes(x=type,y=log(price))) + geom_boxplot(fill='#0078D7')
encode_type <- function(df) {
df$type_encoded = revalue(df$type,replace=c('Unit'='1','Townhouse'='2','House'='3')) %>% as.numeric
return (df)
}
encode_method <- function(df) {
df$method_encoded = revalue(df$method,replace=c('SP'='1','SA'='2','S'='3')) %>% as.character %>% as.numeric
return (df)
}
Gradient Boosted Trees
The model fitting begins with gradient boosted descision trees (GBDT), implemented with the xgboost package, an R api for the popular software library of the same name. GBDT is perhaps the most widely applicable and robust machine learning technique. It is able to capture interactions and non-linear trends without explicit encoding. For more information on xgboost, you can find the xgboost documentation here
Polar Coordinates
Looking at the map above, it appears that property value may be better parameterised with polar coordinates. Let’s set this up now with a function:
#----compute distance from city and bearings------------------------------------
polarise <- function(lnglat_centre,df) {
locs = select(df,c(lng,lat))
df$dist_cbd = apply(locs,1,function(loc){distm(loc,lnglat_centre)})
df$bearing =apply(select(df,c(lng,lat)),1,function(x){
bearing(lnglat_centre,x)
})
return(df)
}
** note loss function**
A variety of features were experimented with, ultimately only a small subset were chosen for the model.
Tuning the number of trees was performed via early stopping, once the validation score stops improving and starts to increase, training is stopped. Other hyperparameters were tuned manually, such as maximum tree depth max_depth and observation and feature subsampling ratios subsample and colsample_bytree. After these were optimised, the learning rate was decreased in order to get a better fit.
#---------------------prepare data for xgboost api---------------------------
xgb_train0 <- encode_type(train0) %>% encode_method
#add polar coordinates
MELBOURNE_CENTRE <- c(144.9631,-37.8136)
xgb_train0 <- polarise(MELBOURNE_CENTRE,xgb_train0)
xgb_features <- c(
'building_area'
,'dist_cbd'
,'bearing'
,'year_built'
,'type_encoded'
,'nrooms'
,'land_area'
,'method_encoded'
,'nbathroom'
,'ncar'
,'ntrain_3000'
)
xgb_train0_y <- log(xgb_train0$price)
xgb_train0_x <- xgb_train0 %>% select(xgb_features)
#prepare the data in an xgboost specific data structure
Dtrain0 <- xgb.DMatrix(data=as.matrix(xgb_train0_x),label=xgb_train0_y)
#------------------------ setup parameters--------------------------------------
PARAMS <- list(
seed=0,
objective='reg:linear',
eta=0.005,
eval_metric='rmse',
max_depth=7,
colsample_bytree=0.78,
subsample=0.80
)
#------------------------cross validation---------------------------------------
set.seed(45785)
cv_obj <- xgb.cv(
params=PARAMS,
data=Dtrain0,
nthreads = 4,
folds=folds,
verbose=FALSE,
nrounds=1e+6,
early_stopping_rounds=2000
)
#save the number of trees that gives the best out of fold generalisation
OPTIMAL_NROUNDS <- cv_obj$best_ntreelimit
#-----------------------------plot----------------------------------------------
#plot training information
eval_log <- cv_obj$evaluation_log
ggplot(eval_log,aes(x=iter)) +
coord_cartesian(ylim=c(0,0.4)) +
geom_line(aes(y=train_rmse_mean),lwd=1,col='#0078D7') +
geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
geom_line(aes(y=test_rmse_mean),lwd=1,col='orange') +
geom_line(aes(y=test_rmse_mean + test_rmse_std),lwd=1,col='orange',lty=2) +
geom_line(aes(y=test_rmse_mean - test_rmse_std),lwd=1,col='orange',lty=2)
#----------------generate out-of-fold (oof) predictions-------------------------
#initialise the out-of-fold predictions vector
xgb_oof_preds <- rep(NA,nrow(train0))
for (i in 1:KFOLD) {
fold <- folds[[i]]
#prepare fold specific training data
train0_x_fold <- xgb_train0_x[-fold,]
train0_y_fold <- xgb_train0_y[-fold]
Dtrain0_fold <- xgb.DMatrix(data=as.matrix(train0_x_fold),label=train0_y_fold)
#prepare fold specific validation data
val0_x_fold <- xgb_train0_x[fold,]
val0_y_fold <- xgb_train0_y[fold]
Dval0_fold <- xgb.DMatrix(data=as.matrix(val0_x_fold),label=val0_y_fold)
#fit the model
model <- xgb.train(params=PARAMS,
data=Dtrain0_fold,
nrounds=OPTIMAL_NROUNDS,
verbose=0)
#save out of fold predictions
preds <- predict(model,newdata=Dval0_fold)
xgb_oof_preds[fold] <- preds
}
#compute out of fold error
xgb_oof_error <- sqrt(mean((xgb_oof_preds-xgb_train0_y )^2))
#---------------------- train model on all train0-------------------------------
#Retrain the model on all train0, this will be used to predict on the ensemble
#validation set, the predictions will be used to
#validate the ensembling meta model
xgb_model <- xgboost(data = as.matrix(xgb_train0_x),
label = xgb_train0_y ,
params = PARAMS,
verbose=FALSE,
nrounds = OPTIMAL_NROUNDS
)
#----------------------------feature importance---------------------------------
#Plot feature importances for this model
xgb_fi <- xgb.importance(model=xgb_model)
xgb.plot.importance(xgb_fi,measure='Gain',main='Feature Importance (gain)')
#----------------create predictions for the ensemble fold-----------------------
#These predictions will be used to validate the ensembling meta model
xgb_ensemble_validation <- encode_type(ensemble_validation) %>% encode_method %>%
polarise(MELBOURNE_CENTRE,.)
xgb_ensemble_validation_y <- log(xgb_ensemble_validation$price)
xgb_ensemble_validation_x <- xgb_ensemble_validation %>%
select(xgb_features) %>% as.matrix
xgb_ens_preds <- predict(xgb_model,newdata <- xgb_ensemble_validation_x)
It was surprising to the author to see the generated OpenStreetMap data almost irrelevant in modelling property value.
Below are some individual conditional expectation (ICE) plots, these are the same as partial dependence plots only they are disaggregated; each thin line shows the estimated impact on predicted log(price) the X variable has for each of the individual observations. By visualising data in this way, we better understand the reliability of the plot, and, can more easily identify interactions. If the thin lines behave differently for different values of the X variable, then X might be interacting with another variable. You can learn more about ICE plots here. The highlighted line is the mean od the others, identical to a partial dependence plot.
Pay careful attention to the tickmarks down the bottom, detailing the distribution of data points along the X-axis. Do 4 or 5 plots
plot_xgb_ice <- function(model,X,y,feature) {
colour <- rep(rgb(0,0,0,0.05),nrow(X))
ice_obj = ice(model,
X=as.matrix(X),
y=y,
predictor=which(feature == names(X)),
frac_to_build=0.1)
plot(ice_obj,plot_orig_pts=F,colorvec=colour,xlab=feature)
}
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'building_area')
## ..................................................................................................................................................................................................................................
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'dist_cbd')
## .............................................................................................................................................................................................................................................................................................................................................................................................................................
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'bearing')
## .............................................................................................................................................................................................................................................................................................................................................................................................................................
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'year_built')
## .........................................................................
check plot frac==1
K-Nearest Neighbours
point out epanechnikov chosen ahead of time, said to be optimal in a rmse sense, (said wiki, sourced a link, dont have money to buy the paper)
**acknowledge limitations of CV in that i chose feats before CV, not to worry, it wont
#---------------------------prepare data----------------------------------------
knn_train0 <- encode_type(train0)
knn_features <- c('building_area'
,'lng'
,'lat'
,'year_built'
,'type_encoded'
,'nrooms'
,'land_area'
)
knn_train0_y <- log(knn_train0$price)
knn_train0_x <- knn_train0 %>% select(knn_features)
#---------------Prepare for caret's optimisation--------------------------------
#caret's trainControl function demands training indicies
#simply passing -folds won't work
train_folds <- lapply(folds,function(f){which(!(1:nrow(knn_train0) %in% f))})
#using folds created earlier, create an object to pass to caret
#for cross validation
trainingParams <- trainControl(index=train_folds,
indexOut=folds,
verbose=FALSE)
#create a dataframe of hyperparameter values to search
tunegrid <- data.frame(kmax=rep(1:30,each=1),
distance=rep(1,30),
kernel=rep('epanechnikov',30)
)
#-----------------------------Run Caret-----------------------------------------
#setup parallel computing cluster
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
#Find the optimal kmax hyperparameter
knn_model <- train(x=knn_train0_x,
y=knn_train0_y,
method='kknn',
metric='RMSE',
tuneGrid = tunegrid,
trControl = trainingParams
)
#-------------------------------------------plot--------------------------------
res <- knn_model$results
res <- select(res,c(kmax,distance,RMSE,MAE))
ggplot(res,aes(x=kmax)) + geom_line(aes(y=RMSE),lwd=2,color='#0078D7')
#save optimal hyperparamaters
tuned <- data.frame(kmax=20,distance=1,kernel='epanechnikov')
#---------------Create out-of-fold (OOF) predictions----------------------------
knn_oof_preds <- rep(NA,nrow(knn_train0))
#iterate over folds
for (i in 1:KFOLD) {
fold <- folds[[i]]
#prepare fold specific data
train0_x_fold <- knn_train0_x[-fold,]
train0_y_fold <- knn_train0_y[-fold]
val0_x_fold <- knn_train0_x[fold,]
val0_y_fold <- knn_train0_y[fold]
#call caret's train() to find the optimal kmax
model <- train(x=train0_x_fold,
y=train0_y_fold,
method='kknn',
metric='RMSE',
tuneGrid=tuned,
trControl = trainControl('none')
)
#compute prediction on the remaining fold
preds <- predict(model$finalModel,newdata=val0_x_fold)
#add to the OOF predictions
knn_oof_preds[fold] <- preds
}
#stop the cluster
stopCluster(cl)
knn_oof_error <- sqrt(mean((knn_oof_preds-knn_train0_y)^2))
#--------------create predictions on the ensemble validation set----------------
knn_ensemble_validation <- encode_type(ensemble_validation)
knn_ensemble_validation_y <- log(knn_ensemble_validation$price)
knn_ensemble_validation_x <- knn_ensemble_validation %>% select(knn_features)
knn_ens_preds <- predict(knn_model,newdata=knn_ensemble_validation_x)
Generalised Additive Model
Now an additive model is fitted using the gam package. This model was fitted last as it relies the most on an understanding of the data.
The model expression was constructed manually, the result of first exploration and then tuning the splines’ freedom. Simplicity was favoured over complexity to avoid overfitting as a consequence of lengthy experimentation with the data.
caret was not used this time, it has a habit of oversimplifying the fitting interface.
#---------------------------prepare data----------------------------------------
train0_gam <- polarise(MELBOURNE_CENTRE,train0)
gc <- gam.control(maxit=100)
form <- formula(
log(price) ~
type + #this is a factor
nrooms +
nbathroom +
factor(nbathroom):factor(nrooms) +
s(building_area_log,df=15) +
s(dist_cbd,df=30) +
s(bearing,df=28) +
s(year_built,df=8) +
s(land_area_log,df=8) +
nsupermarket_2000 +
ntrain_3000
#+ nschool_2000
#+ ncafe_1000
)
#---------------------------Generate out of fold predictions--------------------
#initialise
gam_oof_preds <- rep(NA,nrow(train0))
for (i in 1:KFOLD) {
fold <- folds[[i]]
train0_gam_fold <- train0_gam[-fold,]
val0_gam_fold <- train0_gam[fold,]
gam_model <- gam(formula = form ,
data = train0_gam_fold ,
#control = gc,
family='gaussian'
)
#add out of fold predictions for this fold
preds <- predict(gam_model,newdata=val0_gam_fold)
gam_oof_preds[fold] <- preds
}
#compute error
gam_oof_error <- sqrt(mean((gam_oof_preds-log(train0_gam$price) )^2))
#----------retrain the full model to generate features fo rthe meta model-------
gam_model <- gam(formula = form ,
data = train0_gam ,
#control = gc,
family='gaussian'
)
#--------------Generate metafeatures for the ensembling meta-model--------------
#These predictions will be used to validate the ensembling meta model
gam_ensemble_validation <- encode_type(ensemble_validation) %>%
polarise(MELBOURNE_CENTRE,.)
gam_ens_preds <- predict(gam_model,newdata <- gam_ensemble_validation)